Benchmarking long-read aligners and SV callers for structural variation detection in Oxford nanopore sequencing data

Structural variants (SVs) are one of the significant types of DNA mutations and are typically defined as larger-than-50-bp genomic alterations that include insertions, deletions, duplications, inversions, and translocations. These modifications can profoundly impact the phenotypic characteristics and contribute to disorders like cancer, response to treatment, and infections. Four long-read aligners and five SV callers have been evaluated using three Oxford Nanopore NGS human genome datasets in terms of precision, recall, and F1-score statistical metrics, depth of coverage, and speed of analysis. The best SV caller regarding recall, precision, and F1-score when matched with different aligners at different coverage levels tend to vary depending on the dataset and the specific SV types being analyzed. However, based on our findings, Sniffles and CuteSV tend to perform well across different aligners and coverage levels, followed by SVIM, PBSV, and SVDSS in the last place. The CuteSV caller has the highest average F1-score (82.51%) and recall (78.50%), and Sniffles has the highest average precision value (94.33%). Minimap2 as an aligner and Sniffles as an SV caller act as a strong base for the pipeline of SV calling because of their high speed and reasonable accomplishment. PBSV has a lower average F1-score, precision, and recall and may generate more false positives and overlook some actual SVs. Our results are valuable in the comprehensive evaluation of popular SV callers and aligners as they provide insight into the performance of several long-read aligners and SV callers and serve as a reference for researchers in selecting the most suitable tools for SV detection.


The selection of the validation datasets for SV calling
For benchmarking the existing structural variant calling methods, it is preferable to use multiple datasets, accordingly, three datasets have been used in this evaluation workflow.The first dataset was an ONT real dataset, in FASTQ format, sequenced on PromethION and released by the GIAB consortium for the NA24385 Ashkenazim individual in (https:// ftp-trace.ncbi.nlm.nih.gov/ giab/ ftp/ data/ Ashke nazim Trio/ HG002_ NA243 85_ son/ Ultra long_ Oxfor dNano pore/ guppy-V3.4.5/ (accessed on 3 September 2023), the Genome in a Bottle (GIAB) Consortium created benchmark SV calls and benchmark regions (https:// ftp.ncbi.nih.gov/ giab/ ftp/ data/ Ashke nazim Trio/ analy sis/ NIST_ SVs_ Integ ration_ v0.6/ HG002_ SVs_ Tier1_ v0.6.vcf.gz) (accessed on 3 September 2023).This "Truth set" is considered a resource of highly curated and high-quality variants and was published to the research community.SV calling methods have been released based on the hg19 coordinates.The second dataset was an ONT real dataset, in FASTQ format, sequenced on MinION using a 1D ligation kit and obtained from the Nanopore repository (https:// github.com/ nanop ore-wgs-conso rtium/ NA128 78/ blob/ master/ nanop ore-human-genome/ rel34.md (accessed on 3 September 2023).The SV truth set, for this dataset, was generated by the Genome in a Bottle Consortium using the Pacific Biosciences (PacBio) platform and was used, in this manuscript, as the corresponding SV truth set for the NA12878 dataset.The analysis only included SV calls with a "PASS" flag in the "FILTER" field (https:// ftp-trace.ncbi.nlm.nih.gov/ giab/ ftp/ data/ NA128 78/ NA128 78_ PacBio_ MtSin ai/ NA128 78.sorted.vcf.gz).
The last dataset was a synthetic ONT data, referred to as SI00001, generated using the SV simulator VarIant SimulatOR (VISOR) (https:// github.com/ david ebolo 1993/ VISOR) (accessed on 3 September 2023), as per the simulation instructions to generate the ONT long reads, and was simulated to 50X coverage 32 .The VISOR was

Read mapping and structural variant calling for datasets
The three datasets reads were aligned to the public human genome build GRCh37/UCSC hg19 using four longread aligners "Minimap2" 33 (v2.26), "NGMLR" 34 (v.0.2.7), "LRA" 31 (v1.3.7.2), and "pbmm2" https:// github.com/ Pacifi cBio scien ces/ pbmm2 (v1.7.0) (Table 1).The reason for the alignment of the reads to the previous version of the human reference genome is that the "Benchmark set" for NA12878 and "Truth set" for NA24385, that will be later used as a benchmark reference for this evaluation process, was on the hg19.Also, the SV benchmark set simulated with VISOR was performed using the hg19 build to unify the reference genome build.After the completion of the alignment, a Sequence Alignment Map (SAM) file was generated, which was then converted to Binary Alignment Map (BAM) format using Samtools 35 .The resulting BAM file was sorted and indexed with Samtools to prepare the file for variant calling.Mosdepth was used to calculate the coverage after sorting and indexing the generated alignments 36 .

Enhancing the SV calling accuracy
For enhancing the SV calling accuracy, a tandem repeat Browser Extensible Data (BED) file corresponding to the hg19 reference (https:// raw.githu buser conte nt.com/ Pacifi cBio scien ces/ pbsv/ master/ annot ations/ human_ hs37d5.trf.bed) (accessed on 3 September 2023), was downloaded and used during the variant calling process.Even though Sniffles, SVIM, CuteSV, and PBSV can find all kinds of SV, NpInv was designed to detect inversions accurately.Detection for Inversions (INV) was not in the scope of the current evaluation, but still, it was performed to lay the ground for the future assessment of SV callers on the level of accurate inversion detection.

Filtering for the SV callset
Numerous filtering was accomplished to generate comparable datasets.The SV calls from independent consensus sequences or contigs, and the mitochondrial genome was filtered out leaving only insertions, duplications, and deletions for each call set.For comparison, insertion and duplication calls were combined into one category ("insertions").The SVs were then filtered for length >=50 bp, and only SV calls with a "PASS" flag in the "FILTER" column were filtered in for the next step of the analysis.The performance of SV detection tools was challenging to evaluate because there is no standard technique for precisely identifying SVs in the homo sapiens genome.The www.nature.com/scientificreports/"Truth set"/ "Benchmark set" Variant Call Formats (VCFs) corresponding to the three datasets from GIAB and VISOR were used to address this limitation.The output VCFs of the five SV callers were then compared to this "Truth set"/"Benchmark set" VCF in terms of precision, recall, and F1-score statistical metrics using the toolkit "Truvari" (Table 1) to target the impact of sequencing settings on of the SVs generated from each tool and how close it is to the "Truth callset" where the candidate SVs missing from the truth were reflected false positives, and vice versa for false negatives.

Alignment of ONT datasets using long-read aligners and corresponding truth SV call sets
For the NA24385 dataset, the GIAB consortium's ultra-long ONT FASTQ was used for the evaluation process after their retrieval from the NCBI repository.The initial total coverage was found to be 45X and was down-sampled to depths of coverage of 30X, 20X, and 10X.The truth callset has a great amount of deletions or insertions produced from various sequence lengths and visual charting for the same individual on GRCh37 genome.The NA24385 truth SV callset has 9641 SVs (with FILTER "PASS"), with 5260 insertions and 4381 deletions (Fig. 1).The FASTQ file generated by the nanopore whole-genome sequencing consortium was used for the alignment process.The reported and the calculated depth of coverage was found to be ~ 30X.Then, it was down-sampled to 20X and 10X coverage only.The SV call set is used as a corresponding created by the Genome in a Bottle Collaboration utilizing the Pacific Biosciences (PacBio) platform to generate the equivalent SV true set.There are 10,135 SVs in the NA12878 Benchmark callset (with FILTER "PASS"), with 5783 insertions and 4352 deletions (Fig. 1).The generated synthetic ONT dataset SI00001 was simulated using the SV simulator VISOR at a depth of coverage of 50X.The SV "Benchmark set" used for this dataset included 10,676 randomly generated SVs, which were then divided into 5,027 deletions and 5,027 insertions, and 300 inversions, among other types of structural variants such as duplications and translocations (Fig. 1).The SI00001 aligned bam file was down-sampled into 30X, 20X, and 10X depth of coverage.Generally, each aligner performed equally across the three datasets.In terms of time consumed, Minimap2 was the fastest of the four aligners (8 h), followed closely by LRA (14 h) and Pbmm2 (15 h), whereas NGMLR was the slowest (59 h).The alignment was done on a machine with 128 GB of RAM and 64 threads.The performance of the four aligners was represented in terms of the time taken by the tool to finish the alignment, the CPU time in hours, the wall clock, and the memory usage in gigabytes (Table 2).The metrics for the generated BAM following the four aligners were deposited into the GitHub repository (https:// github.com/ AnkhB ioinf ormat ics/ SVcal lers_ Compa risons).

Evaluation of the different SV callers' performance in terms of precision, recall, and F-score values for SV calling of the NA24385, NA12878 and simulated SI00001 human genome datasets
The chosen four commonly used long-read sequencing SV callers (CuteSV, SVIM, Sniffles, and PBSV) were usually tested against publicly available ultra-long nanopore reads of truth set NA24385 at varying coverages.
In addition to that dataset, the NA12878 and SI00001 datasets were added to enhance the power of the evaluation for the SV callers' performance.It is worth mentioning that the SVcnn caller was previously considered for this evaluation but later rejected as it was extensively time-consuming (80 h and 27.8 GB memory) and crashed repeatedly, so it was not included in the evaluation.www.nature.com/scientificreports/All SV callers were pre-tuned to detect SV of 50 bp and above to unify the parameters for all the callers.As for the filtering of the output VCF generated from each tool, only SVs with "PASS" in the FILTER field and lay in the regions of the 1-22, X and Y chromosomes was regarded as a candidate for evaluating the results of the tools.Calls not matching any true variants are regarded as false positives.In contrast, false negatives were considered callset variants that are not present in the truth set.For combinations of the mentioned aligners and SV callers, we assessed the detected SVs' precision, recall, and F1-score.Each tool's SV calls were marked "true" or "false" according to whether they match with the matching Truth/Benchmark callset.The output of the comparison process was a report with the information generated, including the precision, recall, and F1-score of the obtained high-quality SV callsets.This helped us evaluate the quality of the SV calls for each tool as well as the performance of each tool in terms of CPU time in hours, wall clock, and memory usage in gigabytes, which is presented in Table 3.
The precision, recall, and F-score values for SV calling (Sniffles, SVIM, CuteSV, and PBSV) following Mini-map2, LRA, NGMLR, and Pbmm2 alignments at different depths of coverages are displayed in Tables 4, 5 and 6; for the NA12878 (Figs. 2, 3, 4, 5), NA24385 (Figs. 6, 7, 8, 9) and simulated SI00001 (Figs. 10, 11, 12, 13) human genome datasets, respectively.The benchmarking results for the three reference datasets, combined with four different long-read aligners (Minimap2, LRA, pbmm2, and NGMLR) and four different structural variant callers (CuteSV, Sniffles, PBSV, and SVIM), revealed that the SV caller performance varies depending on the dataset and the specific SV types being analyzed.It was also revealed that the average F1 score increased with sequencing coverage, and that Sniffles and CuteSV tend to perform well across different aligners and coverage levels, followed by SVIM, PBSV, and SVDSS in last place.The CuteSV caller has the highest average F1 score (82.51%) and recall (78.50%) of the five SV callers.Also, CuteSV scored the second-highest average precision value (78.50%),On average, the CuteSV caller has a CPU time of 4.044 h, a wall clock time of 102.3 min, and a memory usage of 3.4 GB across all aligners.The CuteSV caller relies on high-quality alignments to reliably call structural variations, which may affect its performance.It performs well across aligners and uses little CPU and memory.In addition, Sniffles has a CPU time of 4.227 h, a wall clock time of 121.3 min, and a memory usage of 5.1 GB across all aligners.Like CuteSV, Sniffles tends to perform relatively well across all aligners.SVIM's CPU time was 3.445 h, wall clock time was 463.4 min, and memory use was 3.405 GB.The two-step PBSV variant calling process has an average CPU time of 11.81 h and a wall clock time of 336.1 min, with a memory usage of 56.91 GB across all aligners.It is explicitly designed for PacBio long-read data and can be computationally intensive.The three-step SVDSS variant calling process takes an average of 16.183 h on the CPU and 4:01:15 on the wall clock, and memory usage of 70.723 GB across all aligners (Table 3).

Evaluation of the different SV callers' performance against the three datasets in terms of deletions and insertions
Each SV caller called different kinds of SVs in different numbers,, the most common types being deletions and insertions.Because only a small number of SV types other than insertions and deletions were called and some SV true sets only have insertions and deletions, the resulting SV calls from all SV callers were put into two main groups: deletions (DEL) and insertions (INS).The current evaluation did not use other types of SVs in the call sets, like inversions and translocations.The two callers, SVDSS and SVIM, consistently called a higher number of SVs than the other callers and tended to have a higher proportion of both deletions and insertions, and this may explain the F1-scores, precision, and recall values for these two tools.Sniffles and CuteSV tended to call fewer SVs than SVDSS and the SVIM.PBSV called the least number of SVs across all aligners and levels of coverage, which may be due to it being designed for analyzing PacBio long-read data.The results for using NpInv on the three datasets at different coverage degrees revealed that the number of inversions called by the NpInv tool increases with higher levels of coverage, which is expected given the increased sequencing depth and information available at higher coverage levels (Supplementary Table S1-S3).The results also suggest that the choice of aligner can impact the performance of NpInv.However, the differences in performance between the aligners are relatively small, and NpInv appeared to perform well with all the aligners tested.In terms of coverage level, the highest number of inversions was called at the 30X coverage level, followed by the 20X and 10X levels.The same trend in the three datasets indicated that the degree of coverage highly impacts NpInv (Supplementary Table S4-S6).

Evaluation of different SV callers' performance in terms of SV length and their performance in terms of precision, recall and F1-score
In order to comply with the definition of a structural variant, all the SVs that were less than 50 bp were disregarded and filtered-out in the filtration step.The SV count in each group was presented in detail with the demonstration for the SV distribution across different SV length ranges in supplementary tables (Supplementary S7-S9).
In general, CuteSV detected a significant number of SVs in the 50-250 bp range but none in the < 50 bp range.SVIM detected a large number of SVs in the 50-250 bp range and also had substantial detection in the < 50 bp range.PBSV showed consistent detection in the 50-250 bp and 251-500 bp ranges.SVDSS had the highest total number of SVs detected, with a significant number in the < 50 bp and 50-250 bp ranges.At the Total coverage: Sniffles detected the lowest total number of SVs (< 50 bp) and the highest number of SVs in the 50-250 bp range.
CuteSV detected a significant number of SVs in the 50-250 bp range but none in the < 50 bp range.SVIM detected a large number of SVs in the 50-250 bp range and also had substantial detection in the < 50 bp range.PBSV www.nature.com/scientificreports/showed consistent detection in the 50-250 bp and 251-500 bp ranges.SVDSS had the highest total number of SVs detected, with a significant number in the < 50 bp and 50-250 bp ranges.At 30X coverage: Sniffles has a high number of detected variants in the 50-250 bp range followed by 251-500 bp and 501-750 bp ranges.CuteSV detected more variants in the 50-250 bp range, with very few in other ranges.SVIM has a significant detection rate in the < 50 range, followed by the 50-250 bp range.PBSV also has most variants in the 50-250 bp range, with fewer detected as the length increases.SVDSS has a very high number in the < 50 bp range, followed by a substantial count in the 50-250 bp range.At 20X coverage: Sniffles, PBSV, CuteSV, and SVIM generally show similar patterns as seen in 30X coverage, with overall lower counts, SVDSS still remains notably high in the < 50 bp range and lower in higher ranges.At 10X coverage: Sniffles detected a significantly reduced number of variants in all ranges compared to 30X coverage.CuteSV detected fewer variants across all ranges, with zero in the < 50 bp range.SVIM detected a notably high count in the < 50 bp range with a steep drop-off in larger sizes.PBSV again shows a similar pattern with a preference towards the 50-250 bp range.SVDSS still detected a substantial number in the < 50 bp range, markedly more than other callers at this coverage (Supplementary S7-S9).The distribution and the count of the detected SVs in terms of SV length groups were charted into bar charts to give insights about the performance of the different variant callers' vs number of SVs detected per length range for NA12878 (Supplementary Figures S1-S3), NA24385 (Supplementary Figures S4-S7) and SI00001 (Supplementary Figures S8-S11) datasets.The accuracy metrics in terms of precision, recall and F1-score across the different SV length groups were applied to the most commonly studied reference sample NA24385 as this will be valuable towards future studies and evaluation.For Minimap2 Total Coverage: Sniffles showed varying performance across different SV length groups, with precision ranging from 47.01 to 72.80% and recall ranging from 38.14 to 77.21%.The F1-score ranged from 42.11 to 73.28%, indicating variability in its performance across different SV length categories.
CuteSV demonstrated consistently high precision, recall, and F1-score across all SV length groups, with values ranging from 82.98 to 94.73% for precision, 94.63-97.45%for recall, and 88.71-95.03%for F1-score.This indicates strong and consistent performance in detecting SVs across different length categories at this coverage      For Minimap2 at 10X Coverage: SVDSS demonstrated varying performance across different SV length groups, with precision ranging from 90.60 to 99.73%, recall ranging from 90.15 to 99.19%, and F1-score ranging from 90.37 to 99.46%.Sniffles showed varying performance, with precision ranging from 93.13 to 97.05%, recall ranging from 79.65 to 94.97%, and F1-score ranging from 86.09 to 95.63%.CuteSV demonstrated consistently high precision, recall, and F1-score across all SV length groups, with values ranging from 98.92 to 99.47% for precision, 99.12-98.94%for recall, and 99.02-99.20%for F1-score.SVIM showed varying performance, with precision ranging from 95.08 to 99.65%, recall ranging from 94.79 to 98.88%, and F1-score ranging from 94.93 to 99.26%.PBSV demonstrated relatively high precision, recall, and F1-score across different SV length groups, indicating consistent performance in detecting SVs of varying lengths at 10X coverage.
The SV callers' performance with LRA, NGMLR, and Pbmm2 was the same as with Minimap2 where CuteSV demonstrated consistently high precision, recall, and F1-score across all SV length groups and coverage levels,

Discussion
Most previous studies focused on single-nucleotide polymorphisms (SNPs) detection because they are easier to track down using existing sequencing tools and algorithms 39 .A well thought of prevalence of SV over the last 20 years has shifted our viewpoint on its impact on genomic disorders 40 .Despite all these indications of SV  importance, they have received far less attention than SNVs due to their difficulty in detection.In theory, each type of SV produces a distinct outline in plotting reads that can be employed to deduce the basic variations 40 .Multiple SVs can be overlaid or grouped together, resulting in more intricate plotting shapes than when they are viewed separately.Such complex patterns may impede mapping entirely, imposing investigators to rebuild such genomic trials and analysis from scratch 27,41 .
With the introduction of long-read sequencing technology, specifically Pacific Biosciences (PacBio) and ONT, it has become possible to produce reads of thousand base pairs 19,29 .Because of different DNA library preparations, various platforms produce diverse kinds of information 42,43 .As previously reported, the primary distinctions between these types of reads are their length and error rate 44 .Furthermore, assembly-based methods can be utilized for SV detection.It is difficult to assess the performance of SV detection tools because of the absence of a reference scheme for precisely identifying such SVs.To address this limitation, the Genome in a Bottle (GIAB) recently released a sequence-resolved benchmark set for SV detection 45 .We used the long-read nanopore sequencing data results for sample NA24385 deposited in NCBI ftp to produce an accurate archetypal for the assessment of the SV detection algorithms and to create our pipeline that can help SV detection by choosing the aligner and the SV caller that fits the results of an existing benchmark set available from GIAB 44,45 .The NA24385 and NA12878 samples FASTQ, after their retrieval from the NCBI repository and nanopore whole-genome sequencing consortium repository, as well as the simulated dataset SI00001 FASTQ as per the instructions provided in this repository (https:// github.com/ david ebolo 1993/ EViNCe/ tree/ main/ SI000 01 (accessed on 3 September 2023) were aligned to GRCh37 reference genome using four of the most common long-read aligners Minimap2, LRA, NGMLR and Pbmm2 a SMRT wrapper for Mini-map2 developed for PacBio data.To evaluate the impact of sequencing depth on SV calls, subsets were created by down-sampling of the original dataset; each dataset was achieved at 30X, 20X, and 10X sequencing coverages by using Samtools, and using Truvari, benchmarking tool, we calculated the F1 score, precision, and Recall for each of the four studied SV callers at each coverage level.We put five general-purpose SV callers to the test: Sniffles 39 , SVIM 4,19 , CuteSV 30 , PBSV, and SVDSS 37 as they can detect all SV types from long-read alignments with an exception for the SVDSS, which was developed to detect insertions and deletions only and not yet costumed to detect inversions.Currently, ONT recommends Sniffles2 as the go-to SV caller, which was integrated as the SV caller of choice for the variant detection pipeline, along with Clair3 for SNV/Indels detection.
The Sniffles2 caller detects all types of SVs and can be used with any aligner, particularly with Minimap2.As per the recommendation of ONT, this combination was used as the base of the two Nextflow based workflows to manage compute and software resources in various workflows as previously reported 46,47 .After mapping reads to the reference genome, the program detects split-reads and read-pairs that span the potential SV breakpoints.Sniffles2 clusters breakpoint-spanning reads and utilizes a probabilistic algorithm to identify the most likely SV type and breakpoints 39 , while the CuteSV caller collects SV signatures using customized approaches and analyzes them using a clustering-and-refinement process to find sensitive SVs.The CuteSV caller outperformed state-of-the-art techniques in yield and scalability on PacBio and ONT datasets.Furthermore, the CuteSV caller uses split-read and read-pair information to detect SVs.After mapping reads to the reference genome, the tool groups split-reads and read-pairs that support SV breakpoints.The CuteSV caller then uses graphs to determine the most likely SV type and breakpoints 30 .
Meanwhile, SVIM calls structural variants in third-generation sequencing reads, identify, and classify most of the genetic mutations or changes by integrating genome-wide data.SVIM uses de novo assembly to generate contigs spanning potential SV breakpoints.It outperformed competing approaches on simulated and real PacBio and nanopore sequencing data.It combines split-read and read-pair information with de novo insertion event assembly to identify SVs.The SV breakpoints were identified by mapping reads to the reference genome.SVIM then generates contigs spanning these breakpoints using a de novo assembler and aligns them to the reference genome to determine the most likely SV type and breakpoints 19 .PBSV is a variant calling software developed by PacBio to detect structural variants in long-read PacBio sequencing data.It aligns long reads to a reference genome using a long-read aligner and identifies structural variants using split-read; discordant read pairs indicate an SV.PBSV clusters discordant read pairs and finds the most likely SV type and breakpoints using a graph-based technique.PBSV clusters these variants and filters out false positives to identify complex and large structural variants that are hard to distinguish using short-read sequencing data (PacificBiosciences/pbsv, 2022).It is the most useful SV caller for detection of insertions ranging from 20 to 10 kb, deletions ranging from 20 to 100 kb, 200 bp to 10 kb inversions, and duplications ranging from 20 to 10 kb 44 .On the other hand, SVIM employs a graph-based technique to discover signature clusters and final SVs, with each node representing an SV signature, and is known to perform best with PacBio HiFi reads 13 .The PBSV's precision of calling the SVs was much better than the recall across the different coverage datasets.Still, overall, its Recall and Precision were much lower than those reported by other tools.However, in other studies, its performance was better than Sniffles 19 .This may be due to a difference in the dataset and the aligner used for benchmarking and the aligner.
SVDSS is designed to identify SVs in hard-to-call genomic regions using long-read sequencing data and sample-specific strings.SVDSS requires a FASTA format reference genome for sample genotyping.It involves building an FMD index, smoothing the input BAM file, extracting SFS, assembling SFS into superstrings, and calling SVDSS to genotype SVs.It incorporates both split-read and soft-clipping analysis, clustering, and machine learning algorithms to improve accuracy 37 .Regarding Inversions, Inversions are structural variations where a segment of DNA is flipped so the sequence is reversed compared to the reference genome.NpInv is the tool of choice for detecting inversions from long-read sequencing data.It works by analyzing the alignment of long-read sequencing data to a reference genome 48 .NpInv uses a unique approach to detect inversions; It first identifies regions where the long-read sequencing data spans two regions of the reference genome in an orientation inconsistent with the reference genome.Then, it looks for a breakpoint, which is a location where the sequence in the long-read data abruptly changes orientation.Finally, NpInv uses a statistical model to determine whether the orientation change is consistent with an inversion 48 .NpInv is better than other inversion detection tools, such as SVIM, Sniffles, and CuteSV in several ways.Firstly, NpInv is designed specifically for detecting inversions, whereas other tools are designed to detect a broader range of structural variations.This means that Npinv is optimized for detecting inversions and may be more sensitive and specific for this type of structural variation 4,48 .Secondly, NpInv is designed to work with long-read sequencing data, which is typically more informative than short-read sequencing data.Long-read sequencing data allows NpInv to span the breakpoints of inversions, which can be challenging to detect with short-read sequencing data 48 .
Based on the results of the performance of different SV callers with Minimap2 aligner at different coverage depths, we can see that both Sniffles and CuteSV have the highest F1-scores across all coverage depths.The PBSV caller also has a high F1-score but with lower precision.SVIM has a lower F1-score than the other callers, especially at lower coverage depths.SVDSS has the lowest F1-score, precision, and recall at all coverage depths.All callers perform relatively well at higher coverage depths (30X and 20X) with F1-scores above 90%.However, at lower coverage depths (10X), all callers except Sniffles have lower F1-scores, with SVDSS having the lowest F1-score of only 31.3%.
Regarding the performance of different SV callers with LRA aligner at different coverage depths, we see that the CuteSV caller has the highest F1-score and recall at all coverage depths.The Sniffles caller has the highest precision but lower recall compared to the CuteSV caller.SVIM performs well with an F1-score above 90% at all coverage depths.PBSV has a relatively low F1-score and recall compared to the other callers.SVDSS has the lowest F1-score, precision, and recall at all coverage depths.All callers perform relatively well at higher coverage depths (30X and 20X) with F1-scores above 75%.However, at lower coverage depths (10X), all callers except CuteSV have lower F1-scores, with SVDSS having the lowest F1-score of only 30.31%.
The performance of different SV callers with NGMLR aligner at different coverage depths shows that the CuteSV caller has the highest F1-score and recall at all coverage depths.Sniffles has the highest precision but lower recall compared to the CuteSV caller.SVIM performs well with an F1-score above 80% at all coverage depths.PBSV has a relatively low F1-score and recall compared to the other callers.SVDSS has the lowest F1-score, precision, and recall at all coverage depths.All callers perform relatively well at higher coverage depths (30X and 20X) with F1-scores above 70%.However, at lower coverage depths (10X), all callers except CuteSV have lower F1-scores, with SVDSS having the lowest F1-score of only 53.78%.
The performance of different SV callers with Pbmm2 aligner at different coverage depths shows that SVIM has the highest F1-score, precision, and recall at all coverage depths.The CuteSV caller has a relatively low F1-score at all coverage depths but still performs better than Sniffles and PBSV.SVDSS has the lowest F1-score, precision, and recall at all coverage depths.All callers perform relatively well at higher coverage depths (30X and 20X) with F1-scores above 70%.However, at lower coverage depths (10X), all callers have lower F1-scores, with SVDSS having the lowest F1-score of only 27.27%.
After analyzing the precision, recall, and F1-score data of different variant callers coupled with Minimap2, LRA, NGMLR, and Pbmm2 aligners and with respect to the SV length, several trends and patterns emerge.CuteSV consistently demonstrates high precision, recall, and F1-score across all aligners, indicating its robust performance in detecting structural variants (SVs) across different length groups and coverage levels.Sniffles exhibits competitive performance with varying precision and recall, especially for larger SVs even though this particular variant caller was a top performer when testing on the unbinned reference.SVDSS consistently shows strong performance across aligners, with relatively high precision, recall, and F1-score at each SV length group even though it showed very poor performance when testing on the unbinned reference which also lays the groud for future investigation to this behavior.SVIM demonstrates competitive performance in detecting SVs of various lengths at different coverage levels.PBSV exhibits relatively high precision, recall, and F1-score across different SV length groups, indicating consistent performance in detecting SVs.In conclusion, CuteSV emerges as a top performer across all aligners, demonstrating consistent and robust performance in detecting SVs.Sniffles shows competitive performance, especially for larger SVs.SVIM demonstrates competitive performance, while PBSV exhibits relatively high precision and recall.These findings suggest that the choice of aligner and variant caller can significantly impact the accuracy and sensitivity of SV detection.
The percentages for recall and precision fluctuate with coverages as low as 10X, indicating that low coverages should not be included in structural variations calling routines, where 20X coverage appears to be the minimum coverage required to maintain the tools' performance as determined by the F1 score.The comparison metrics results proved the usual tendencies for higher sequencing depth to increase recall and precision, though these can be disproportional depending on the tool itself.More flexible thresholds boost recall but decrease precision, whereas tougher cut-offs do the opposite.The precision and recall rates of each form of SV were studied.Each method worked best for deletions and insertions, which comprise most SVs in the human genome.Based on the results presented in the paper, both Sniffles and CuteSV consistently perform well across different aligners and coverage depths in terms of F1-score, precision, and recall.Sniffles should be preferred if high precision is required, while the CuteSV caller and Sniffles should be selected if a high recall is needed.The Minimap2 aligner and Sniffles are recommended for preliminary analysis due to their great rapidity and stable performance for both insertions and deletions.
In summary, the best-performing SV caller depends on the aligner and coverage depth used.The CuteSV caller consistently performs well across different aligners and coverage depths, with high F1-scores and recall.Sniffles has high precision, but lower recall compared to CuteSV.SVIM performs well with high F1-scores, precision, and recall at all coverage depths with Pbmm2 aligner.PBSV has a relatively low F1-score and recall compared to other callers.SVDSS consistently has the lowest F1-score, precision, and recall at all coverage depths.Researchers should select the appropriate SV caller based on their specific data and research question, considering the aligner and coverage depth used.Recently, it was proposed as a possible approach to enhance the performance of the available SV callers and syndicate reads from multiple pipelines, such as from Sniffles, CuteSV, and SVIM, which can help reduce the overall false positive rate 3 .Researchers should select the appropriate SV caller based on their specific data and research question, considering the aligner and coverage depth used.Moreover, various studies have investigated and evaluated the available variant calling tools for Oxford nanopore sequencing in breast cancer 4,49 as well as in the metagenome discovery of various secondary metabolites of various microorganisms 50,51 as well as for the detection of various plant pathogens 52 .

Conclusions
The current study highlights how different aligners and coverage levels affect the performance of various SV callers, with their performance varying depending on the dataset being analyzed.The choice of aligner can significantly impact the performance of structural variant (SV) callers, with Minimap2 outperforming NGMLR and LRA in recall, precision, and F1-score percentages, likely due to its ability to handle long reads.The lower coverage levels decrease SV callers' performance due to fewer available reads.The Sniffles and CuteSV caller perform well across different aligners and coverage levels, accurately identifying various SV types.Both SVIM and PBSV perform well in some cases but have more variable performance, with SVIM having a lower recall and F1-scores and PBSV having high recall but lower precision at lower coverage levels.SVDSS consistently has the lowest F1-score, precision, and recall at all coverage depths.Based on the findings, the usage of SV callers such as the Sniffles or CuteSV are recommended for the preliminary data assessment because they achieve significant correctness, particularly upon evaluating low-coverage data.The Minimap2 as an aligner and Sniffles as an SV caller were chosen and suggested aligners as bases of the pipeline for SV calling because of their high speed and reasonable accomplishment when applying genomic mutation such as insertions and deletions.Overall, our study provides a comprehensive evaluation of popular SV callers and aligners.It can serve as a reference for researchers in selecting the most suitable tools for their SV detection needs.

Figure 1 .
Figure 1.The number distribution of Deletions (DEL) and Insertions (INS) for the NA24385 Truth set, NA12878, and SI00001 benchmark sets.

Table 1 .
Summary of the tools used for SV calling, annotation, and benchmarking.

Table 2 .
Performance and resource consumption of Aligners regarding running time and memory usage.

Table 3 .
SV callers' resource consumption and performance in terms of CPU time, wall clock, and memory usage.BAM Binary Alignment Map, LRA Long Read Aligner, NGMLR CoNvex Gap-cost alignMents for Long Reads, SV Structural Variant, SVIM Structural Variant Identification Method, PBSV Pacific Biosciences Structural Variant, Sniffles, CuteSV, Structural Variant SVIM and PBSV (SV detection tools).

Table 4 .
The precision, recall, and F-score values for SV calling for the NA12878 sample with Sniffles, SVIM, CuteSV, PBSV and SVDSS following Alignment with the four evaluated aligners Minimap2, LRA, ngmlr and pbmm2 at different depths of coverage.

Table 5 .
The precision, recall, and F-score values for SV calling for the NA24385 sample with Sniffles, SVIM, CuteSV, PBSV and SVDSS following Alignment with the four evaluated aligners Minimap2, LRA, ngmlr and pbmm2 at different depths of coverage.

Table 6 .
The precision, recall, and F-score values for SV calling for the SI00001 sample with Sniffles, SVIM, CuteSV, PBSV and SVDSS following Alignment with the four evaluated aligners Minimap2, LRA, ngmlr and pbmm2 at different depths of coverages.